What if you had data that looked like this? It’s square, there’s clear edges that define the classes and it’s non-linear. It would be difficult to mathematically represent this data using a linear model like linear regression, logistic regression, glm, etc.
You could fit a logistic regression to this model.
mod <- glm(Y ~ X1 + X2, data = .data, family = 'binomial')
preds <- predict(mod, type = 'response') > 0.5
And if you know exactly what you were doing, you could add an interaction and get slightly better results.
mod <- glm(Y ~ X1 * X2, data = .data, family = 'binomial')
preds <- predict(mod, type = 'response') > 0.5
But knowing that functional form is difficult, especially in real-world high-dimensional datasets. Decision trees over
mod_tree <- rpart::rpart(Y ~ X1 + X2, data = .data)
preds <- predict(mod_tree) > 0.5
We will start at the lowest building block of the decision trees – the impurity metric – and build up from there.
And then you can extend the tree model into more complex models like bagging, random forest, and XGBoost: 4. Program a bagging model by implementing many decision trees and resampling the data 5. Program a random forest model by implementing many decision trees, resampling the data, and sampling from the columns 6. XGB
Binary decision trees create an interpretable decision-making framework for making a single prediction. Suppose a patient comes into your clinic with chest pain and you wish to diagnose them with either a heart attack or not a heart attack. A simple framework of coming to that diagnosis could look like the below diagram. Note that each split results in two outcomes (binary) and every possible condition leads to a terminal node.
The model’s splits can also be visualized as partitioning the feature
space. Since the decision tree makes binary splits along a feature, the
resulting boundaries will always be rectangular. Further growing of the
above decision tree will result in more but smaller boxes. Additional
features (X1, X2, ...) will result in additional dimensions
to the plot.
But where to split the data? The splits are determined via an
impurity index. With each split, the algorithm maximizes the purity of
the resulting data. If a potential split results in classes
[HA, HA] and [NHA, NHA] then that is chosen
over another split that results in [HA, NHA] and
[NHA, HA]. At each node, all possible splits are tested and
the split that maximizes purity is chosen.
For classification problems, a commonly used metric is Gini impurity.
Gini impurity is 2 * p * (1 - p) where p is
the fraction of elements labeled as the class of interest. A value of
0 is a completely homogeneous vector while 0.5 is the
inverse. The vector [NHA, HA, NHA] has a Gini value of
2 * 1/3 * 2/3 = 0.444. Since Gini is used for comparing
splits, a Gini value is calculated per each resulting vector and then
averaged – weighted by the respective lengths of the two vectors.
Trees will struggle when the parameter space is dissected at an angle by the classification value. Since regression trees are partitioning the parameter space into rectangles, the tree will need to be deeper to approximate the decision boundary.
The below data’s classification is in two separate triangles: top left and bottom right of the plot. A logistic regression finds the boundary easily.
# decision tree
mod_tree <- rpart::rpart(Y ~ X1 + X2, data = .data, control = rpart::rpart.control(maxdepth = 2))
preds <- predict(mod_tree) > 0.5
# logistic regression
model_log <- glm(Y ~ X1 + X2, data = .data, family = 'binomial')
preds <- predict(model_log, type = 'response') > 0.5
Single decision trees are prone to overfitting and can have high variance on new data. A simple solution is to create many decision trees based on resamples of the data and allow each tree to “vote” on the final classification. This is bagging. The process keeps the low-bias of the single tree model but reduces overall variance.
The “vote” from each tree is their prediction for a given observation. The votes are averaged across all the trees and the final classification is determined from this average. The trees are trained on bootstrapped data – taking repeated samples of the training data with replacement.
Random forest is like bagging except in addition to bootstrapping the observations, you also take a random subset of the features at each split. The rule-of-thumb sample size is the square root of the total number of features.
credit <- readr::read_csv(file.path(here::here(), 'workshop', 'data', 'credit_card.csv'))
Benefits of tree methods:
Downsides: